Robust Overlapping Co-clustering
نویسندگان
چکیده
Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. On such datasets, in order to accurately identify meaningful clusters, both non-informative data points and non-discriminative features need to be discarded. Additionally, since clusters could exist in different subspaces of the feature space, a co-clustering algorithm that simultaneously clusters objects and features is often more suitable as compared to one that is restricted to traditional “one-sided” clustering. We propose Robust Overlapping Co-clustering (ROCC), a scalable and very versatile framework that addresses the problem of efficiently detecting dense, arbitrarily positioned, possibly overlapping co-clusters in a dataset. ROCC works with a large variety of distance measures and different co-cluster definitions, making it applicable to a wide range of real life datasets. Through extensive experimentation we show that our approach is significantly more accurate in identifying biologically meaningful co-clusters in microarray data as compared to several other prominent approaches proposed for this task. We also point out other interesting applications of the proposed framework in solving challenging clustering problems. 1 Motivation When clustering certain real world datasets, it has been observed that only a part of the data forms cohesive clusters. For example, in the case of microarray data, typically only a small subset of the genes cluster well and the rest can be considered non-informative [GG06]. Problems addressed by eCommerce businesses, such as market basket analysis and fraud detection involve huge, noisy datasets with coherent patterns occurring only in small pockets of the data. Moreover, for such data, coherent clusters could be arbitrarily positioned in subspaces formed by different, possibly overlapping subsets of features, e.g., different subsets of genes may be correlated across different subsets of experiments in microarray data. Additionally, it is possible that some features may not be relevant to any cluster. Traditional clustering algorithms like k -means do not address either issue since they assign every data point to a cluster based on a similarity measure computed across all the features. Feature selection or feature clustering [DMK03, DB04] improve clustering results on high dimensional and noisy datasets, but do not allow clusters existing in different subsets of the feature space to be detected easily. Co-clustering simultaneously clusters the data along multiple axes, e.g., in the case of microarray data it simultaneously clusters the genes as well as the experiments [CC00] and can hence detect clusters existing in different subspaces of the feature space. In this paper we focus on real life datasets, where co-clusters are arbitrarily positioned in the data matrix, could be overlapping and are obfuscated by the presence of a large number of irrelevant points. Our goal is to discover dense, arbitrarily positioned and overlapping co-clusters in the data, while simultaneously pruning away non-informative objects and features. 2 Related Work Density based clustering algorithms have a motivation similar to our proposed approach and use the notion of local density to cluster only a relevant subset of the data into multiple dense clusters. DBSCAN [EKSX96] and its improved versions such as OPTICS rely on the notion of density to find arbitrarily shaped clusters in large spatial databases in the presence of noise. These approaches however, are not scalable to high dimensional datasets and are limited to Euclidean or related distance measures. The One Class Information Bottleneck algorithm [CC04] and the Batch Ball One Class Clustering algorithm [GG05] are efficient and scalable algorithms that can work with a large class of distance measures. However, both algorithms find only a single dense region. The Bregman Bubble Clustering (BBC) technique [GG06] addresses the problem of discovering multiple, dense regions in a dataset while discarding the relatively non-coherent parts of the data. BBC provides a robust, scalable framework for clustering only a relevant fraction of the data. However, all of these approaches are developed for one-sided clustering only, where the data points are clustered based on their similarity across the entire set of features. In contrast, both co-clustering (biclustering) and subspace clustering approaches locate clusters in subspaces of the feature space. The literature is both areas is recent but explosive, so we refer to the surveys and comparative studies in [MO04, PHL04, PSBea06] as good starting points. As we shall see in Section 3, none of the existing methods provide the full set of capabilities that the proposed method provides.
منابع مشابه
Identifying robust clusters and multi-community nodes by combining top-down and bottom-up approaches to clustering
Biological functions are often realized by groups of interacting molecules or cells. Membership in these groups may overlap when molecules or cells are reused in multiple functions. Traditional clustering methods assign components to no more than one group, and cannot identify multi-community nodes. Technical noise is common in high-throughput biological datasets and further blurs distinctions ...
متن کاملA heuristic-based fuzzy co-clustering algorithm for categorization of high-dimensional data
Fuzzy co-clustering is a technique that performs simultaneous fuzzy clustering of objects and features. It is known to be suitable for categorizing high-dimensional data, due to its dynamic dimensionality reduction mechanism achieved through simultaneous feature clustering. We introduce a new fuzzy co-clustering algorithm called Heuristic Fuzzy Co-clustering with the Ruspini’s condition (HFCR),...
متن کاملA Generalized Framework for Mining Arbitrarily Positioned Overlapping Co-clusters
The goal of co-clustering is to simultaneously cluster both rows and columns in a given matrix. Motivated by several applications in text mining, recommendation systems and bioinformatics, different methods have been developed to discover local patterns that cannot be identified by traditional clustering algorithms. In spite of much research in this domain, existing co-clustering algorithms hav...
متن کاملModel-based Overlapping Co-Clustering
Co-clustering or simultaneous clustering of rows and columns of two-dimensional data matrices, is a data mining technique with various applications such as text clustering and microarray analysis. Most proposed co-clustering algorithms work on the data matrices with special assumptions and they also assume the existence of a number of mutually exclusive row and column clusters, but it is believ...
متن کاملPredictive Overlapping Co-Clustering
In the past few years co-clustering has emerged as an important data mining tool for two way data analysis. Coclustering is more advantageous over traditional one dimensional clustering in many ways such as, ability to find highly correlated sub-groups of rows and columns. However, one of the overlooked benefits of co-clustering is that, it can be used to extract meaningful knowledge for variou...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008